Training Statistical Language Models from Grammar-Generated Data: A Comparative Case-Study

نویسندگان

  • Beth Ann Hockey
  • Manny Rayner
  • Gwen Christian
چکیده

Statistical language models (SLMs) for speech recognition have the advantage of robustness, and grammar-based models (GLMs) the advantage that they can be built even when little corpus data is available. A known way to attempt to combine these two methodologies is first to create a GLM, and then use that GLM to generate training data for an SLM. It has however been difficult to evaluate the true utility of the idea, since the corpus data used to create the GLM has not in general been explicitly available. We exploit the Open Source Regulus platform, which supports corpus-based construction of linguistically motivated GLMs, to perform a methodologically sound comparison: the same data is used both to create an SLM directly, and also to create a GLM, which is then used to generate data to train an SLM. An evaluation on a medium-vocabulary task showed that the indirect method of constructing the SLM is in fact only marginally better than the direct one. The method used to create the training data is critical, with PCFG generation heavily outscoring CFG generation.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Improving out-of-coverage lang multimodal dialogue system usin

For automatic speech recognition, the construction of an adequate language model may be difficult when only a limited amount of training text is available. Previous work has shown that in the case of small training sets statistical language models may outperform grammars on out-of-coverage utterances, while showing comparable performance on incoverage input. In this paper, we compare the perfor...

متن کامل

Improving out-of-coverage language modelling in a multimodal dialogue system using small training sets

For automatic speech recognition, the construction of an adequate language model may be difficult when only a limited amount of training text is available. Previous work has shown that in the case of small training sets statistical language models may outperform grammars on out-of-coverage utterances, while showing comparable performance on incoverage input. In this paper, we compare the perfor...

متن کامل

Comparative Impacts of Mindsettings on EFL Learners' Grammar Achievement

The present study was conducted to investigate the comparative impacts of three types of EFL teach- ers' mindsettings on EFL learners' grammar achievement. The participants of the study were English Translation undergraduate students (both female and male with the age ranging of 18-35) who were selected according to convenience non-random sampling from three classes of English Grammar 1 at both...

متن کامل

Teacher Language Awareness from the Procedural Perspective: The Case of Novice versus Experienced EFL Teachers

Despite the abundance of research on ELT teachers, little is known about teacher language awareness (TLA) with focus on its impact on pedagogical practice in the EFL context. To fill this gap, an in-depth study was conducted to examine the procedural dimension of TLA among eight EFL teachers with different teaching experiences (novice versus experienced) related to teaching grammar at Iranian l...

متن کامل

NE Tagging for Urdu based on Bootstrap POS Learning

Part of Speech (POS) tagging and Named Entity (NE) tagging have become important components of effective text analysis. In this paper, we propose a bootstrapped model that involves four levels of text processing for Urdu. We show that increasing the training data for POS learning by applying bootstrapping techniques improves NE tagging results. Our model overcomes the limitation imposed by the ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2008